The NBA Player Statistics dataset [1] provides a comprehensive compilation of performance metrics for players across multiple seasons of the National Basketball Association (NBA). This dataset encompasses a wide range of statistics that reflect player performance in various aspects of the game, including offensive and defensive skills.
Scope: The data captures detailed player statistics such as points per game, assists, rebounds, steals, and blocks, among others. This allows for a multifaceted analysis of player contributions and effectiveness on the court.
Utility: Analysts and enthusiasts can use this dataset to evaluate player performance trends, develop predictive models for future performances, and compare players across different seasons and team compositions. Applications: Beyond individual player analysis, the dataset serves as a foundational tool for team strategy development, game analytics, and in-depth research into the dynamics of professional basketball.
Positions in basketball:
This dataset has been modified to show only the 5 typical basketball positions: Center, Power Forward (PF), Point Guard (PG), Small Forward (SF), and Shooting Guard (SG). In this dataset some players play more than one position so they would be labelled as position1-position2. In this dataset we adjusted to show only the main position1 of a player.
The following is what the columns of data stand for:
Rk: Rank of the player (integer)
Player: Name of the player (character)
Pos: Position of the player (factor with 5 levels: “C”, “PF”, “PG”, “SF”, “SG”)
Age: Age of the player (integer)
Tm: Team of the player (factor with 38 levels)
G: Number of games the player was in (integer)
GS: Number of games the player started (integer)
MP: Minutes played per game (numeric)
FG: Field goals made per game (numeric)
FGA: Field goal attempts per game (numeric)
FG.: Field goal percentage (numeric)
X3P: 3-point field goals made per game (numeric)
X3PA: 3-point field goal attempts per game (numeric)
X3P.: 3-point field goal percentage (numeric)
X2P: 2-point field goals made per game (numeric)
X2PA: 2-point field goal attempts per game (numeric)
X2P.: 2-point field goal percentage (numeric)
eFG.: Effective field goal percentage (numeric)
FT: Free throws made per game (numeric)
FTA: Free throw attempts per game (numeric)
FT.: Free throw percentage (numeric)
ORB: Offensive rebounds per game (numeric)
DRB: Defensive rebounds per game (numeric)
TRB: Total rebounds per game (numeric)
AST: Assists per game (numeric)
STL: Steals per game (numeric)
BLK: Blocks per game (numeric)
TOV: Turnovers per game (numeric)
PF: Personal fouls per game (numeric)
PTS: Points per game (numeric)
Season: The season of the record (character)
MVP: Whether the player was the Most Valuable Player (factor with 2 levels: “False”, “True”)
#data location
setwd("C:/Users/racha/Desktop/STAT 515")
#setwd("/Users/karar/Documents/Mason/DataAnalyticsMasters/STAT515/Final Project/STAT515_Final_Project/")
library(dplyr)
library(caret)
library(ggplot2)
library(GGally)
library(plotly)
library(tidyverse)
library(randomForest)
library(caret)
library(reshape2)
nba_data = read.csv("nba.csv") # Ensure the file path is correct
#nba_data = read.csv("NBA_Player_Stats_2.csv")
print(colnames(nba_data))
## [1] "Rk" "Player" "Pos" "Age" "Tm" "G" "GS" "MP"
## [9] "FG" "FGA" "FG." "X3P" "X3PA" "X3P." "X2P" "X2PA"
## [17] "X2P." "eFG." "FT" "FTA" "FT." "ORB" "DRB" "TRB"
## [25] "AST" "STL" "BLK" "TOV" "PF" "PTS" "Season" "MVP"
summary(nba_data)
## Rk Player Pos Age
## Min. : 1.0 Length:14573 Length:14573 Min. :18.00
## 1st Qu.:124.0 Class :character Class :character 1st Qu.:23.00
## Median :243.0 Mode :character Mode :character Median :26.00
## Mean :244.3 Mean :26.71
## 3rd Qu.:361.0 3rd Qu.:30.00
## Max. :605.0 Max. :44.00
##
## Tm G GS MP
## Length:14573 Min. : 1.00 Min. : 0.00 Min. : 0.00
## Class :character 1st Qu.:22.00 1st Qu.: 0.00 1st Qu.:11.40
## Mode :character Median :48.00 Median : 7.00 Median :18.90
## Mean :45.54 Mean :21.57 Mean :19.62
## 3rd Qu.:70.00 3rd Qu.:39.00 3rd Qu.:27.70
## Max. :85.00 Max. :83.00 Max. :43.70
##
## FG FGA FG. X3P
## Min. : 0.000 Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.: 1.300 1st Qu.: 3.100 1st Qu.:0.3930 1st Qu.:0.0000
## Median : 2.400 Median : 5.500 Median :0.4350 Median :0.3000
## Mean : 2.932 Mean : 6.599 Mean :0.4324 Mean :0.5909
## 3rd Qu.: 4.100 3rd Qu.: 9.200 3rd Qu.:0.4790 3rd Qu.:1.0000
## Max. :12.200 Max. :27.800 Max. :1.0000 Max. :5.3000
## NA's :88
## X3PA X3P. X2P X2PA
## Min. : 0.000 Min. :0.0000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.100 1st Qu.:0.2220 1st Qu.: 1.000 1st Qu.: 2.100
## Median : 1.100 Median :0.3260 Median : 1.800 Median : 3.900
## Mean : 1.704 Mean :0.2843 Mean : 2.342 Mean : 4.895
## 3rd Qu.: 2.800 3rd Qu.:0.3750 3rd Qu.: 3.300 3rd Qu.: 6.800
## Max. :13.200 Max. :1.0000 Max. :12.100 Max. :23.400
## NA's :2198
## X2P. eFG. FT FTA
## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.4230 1st Qu.:0.4380 1st Qu.: 0.500 1st Qu.: 0.700
## Median :0.4700 Median :0.4830 Median : 1.000 Median : 1.400
## Mean :0.4648 Mean :0.4735 Mean : 1.401 Mean : 1.872
## 3rd Qu.:0.5140 3rd Qu.:0.5240 3rd Qu.: 1.900 3rd Qu.: 2.500
## Max. :1.0000 Max. :1.5000 Max. :10.300 Max. :13.100
## NA's :154 NA's :88
## FT. ORB DRB TRB
## Min. :0.0000 Min. :0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.:0.6600 1st Qu.:0.30 1st Qu.: 1.300 1st Qu.: 1.70
## Median :0.7500 Median :0.70 Median : 2.200 Median : 2.90
## Mean :0.7262 Mean :0.91 Mean : 2.522 Mean : 3.43
## 3rd Qu.:0.8220 3rd Qu.:1.30 3rd Qu.: 3.300 3rd Qu.: 4.60
## Max. :1.0000 Max. :6.80 Max. :12.000 Max. :18.00
## NA's :749
## AST STL BLK TOV
## Min. : 0.000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.: 0.500 1st Qu.:0.3000 1st Qu.:0.1000 1st Qu.:0.600
## Median : 1.200 Median :0.5000 Median :0.2000 Median :1.000
## Mean : 1.758 Mean :0.6215 Mean :0.3902 Mean :1.132
## 3rd Qu.: 2.300 3rd Qu.:0.9000 3rd Qu.:0.5000 3rd Qu.:1.500
## Max. :12.800 Max. :3.0000 Max. :6.0000 Max. :5.700
##
## PF PTS Season MVP
## Min. :0.000 Min. : 0.000 Length:14573 Length:14573
## 1st Qu.:1.200 1st Qu.: 3.400 Class :character Class :character
## Median :1.800 Median : 6.400 Mode :character Mode :character
## Mean :1.782 Mean : 7.853
## 3rd Qu.:2.400 3rd Qu.:11.100
## Max. :6.000 Max. :36.100
##
str(nba_data)
## 'data.frame': 14573 obs. of 32 variables:
## $ Rk : int 1 2 3 4 4 4 5 6 7 8 ...
## $ Player: chr "Mahmoud Abdul-Rauf" "Tariq Abdul-Wahad" "Shareef Abdur-Rahim" "Cory Alexander" ...
## $ Pos : chr "PG" "SG" "SF" "PG" ...
## $ Age : int 28 23 21 24 24 24 22 23 33 27 ...
## $ Tm : chr "SAC" "SAC" "VAN" "TOT" ...
## $ G : int 31 59 82 60 37 23 82 66 50 61 ...
## $ GS : int 0 16 82 22 3 19 82 13 0 56 ...
## $ MP : num 17.1 16.3 36 21.6 13.5 34.7 40.1 27.9 8 30.5 ...
## $ FG : num 3.3 2.4 8 2.9 1.6 4.8 6.9 3.6 0.7 4.4 ...
## $ FGA : num 8.8 6.1 16.4 6.7 3.9 11.1 16 8.9 1.6 11 ...
## $ FG. : num 0.377 0.403 0.485 0.428 0.414 0.435 0.428 0.408 0.444 0.398 ...
## $ X3P : num 0.2 0.1 0.3 1.1 0.5 2 1.6 0.3 0 0.9 ...
## $ X3PA : num 1 0.3 0.6 2.9 1.7 4.9 4.5 1.3 0.1 2.6 ...
## $ X3P. : num 0.161 0.211 0.412 0.375 0.313 0.411 0.364 0.202 0 0.356 ...
## $ X2P : num 3.2 2.4 7.7 1.8 1.1 2.8 5.2 3.4 0.7 3.5 ...
## $ X2PA : num 7.8 5.7 15.8 3.7 2.2 6.2 11.5 7.6 1.5 8.4 ...
## $ X2P. : num 0.405 0.414 0.488 0.469 0.494 0.455 0.453 0.442 0.474 0.411 ...
## $ eFG. : num 0.386 0.409 0.493 0.51 0.483 0.525 0.479 0.422 0.444 0.44 ...
## $ FT : num 0.5 1.4 6.1 1.3 0.7 2.4 4.2 4.2 0.3 2.5 ...
## $ FTA : num 0.5 2.1 7.8 1.7 1 2.8 4.8 4.8 0.8 3.2 ...
## $ FT. : num 1 0.672 0.784 0.784 0.676 0.846 0.875 0.873 0.39 0.789 ...
## $ ORB : num 0.2 0.7 2.8 0.3 0.2 0.4 1.5 0.8 0.8 0.6 ...
## $ DRB : num 1 1.2 4.3 2.2 1.1 3.9 3.4 2 1.6 2.2 ...
## $ TRB : num 1.2 2 7.1 2.4 1.3 4.3 4.9 2.8 2.4 2.8 ...
## $ AST : num 1.9 0.9 2.6 3.5 1.9 6 4.3 3.4 0.3 5.7 ...
## $ STL : num 0.5 0.6 1.1 1.2 0.7 2 1.4 1.3 0.4 1.4 ...
## $ BLK : num 0 0.2 0.9 0.2 0.1 0.3 0.1 0.2 0.2 0 ...
## $ TOV : num 0.6 1.1 3.1 1.9 1.3 2.8 3.2 1.9 0.3 2.3 ...
## $ PF : num 1 1.4 2.5 1.6 1.4 2 3 2.1 1.7 2.2 ...
## $ PTS : num 7.3 6.4 22.3 8.1 4.5 14 19.5 11.7 1.8 12.2 ...
## $ Season: chr "1997-98" "1997-98" "1997-98" "1997-98" ...
## $ MVP : chr "False" "False" "False" "False" ...
# Handling missing values
nba_data = na.omit(nba_data)
# Pre-process the data: Convert factors
nba_data$Tm = as.factor(nba_data$Tm)
nba_data$MVP = as.factor(nba_data$MVP)
nba_data = nba_data %>% separate(Pos, into = c("Pos", "Pos2"), sep = "-")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 11764 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
nba_data = nba_data %>% subset(select = -Pos2)
nba_data$Pos = as.factor(nba_data$Pos)
nba_data_filtered = nba_data %>%
filter(G > 20)
Once this operation is executed, the resulting dataset
(nba_data_filtered) will contain only
those players who have played more than 20 games. This filtered dataset
is likely to have fewer outliers in performance metrics caused by small
sample sizes.
nba_data_filtered = nba_data_filtered %>%
mutate(`FG%` = `FG.` / 100,
`3P%` = `X3P.` / 100,
`2P%` = `X2P.` / 100,
`eFG%` = `eFG.` / 100,
`FT%` = `FT.` / 100) %>%
select(-c(`FG.`, `X3P.`, `X2P.`, `eFG.`, `FT.`))
Once this snippet is executed, the resulting dataset will have correctly formatted percentage columns which are essential for any statistical analysis involving ratios or comparisons, such as calculating efficiency or shooting accuracy. This step also cleans up the dataset by removing the original columns that are no longer necessary after the correction.
library(ggplot2)
ggplot(nba_data, aes(x = PTS)) +
geom_histogram(bins = 30, fill = "blue", color = "black") +
labs(title = "Distribution of Points Per Game", x = "Points Per Game", y = "Frequency")
ggplot(nba_data, aes(x = Pos, y = PTS, fill = Pos)) +
geom_boxplot() +
labs(title = "Points Per Game by Player Position", x = "Position", y = "Points Per Game")
The histogram shows that the distribution of points per game is right-skewed, meaning most players score on the lower end of the scale, with fewer players averaging high points per game. This is typical in sports data where only a few top performers reach the higher end of the scoring spectrum. The peak of the distribution is around 2 to 6 points per game, indicating that this range is the most common scoring output among players.
The boxplot reveals several interesting points about scoring across different positions:
Variability: There’s a notable variation in median points per game among positions. For instance, positions like Shooting Guard (SG) and Point Guard (PG) typically have higher medians and wider interquartile ranges, suggesting these positions are likely to score more.
Outliers: Several positions show outliers, especially in scoring roles like SG and PG, indicating some players in these positions significantly outscore their peers.
Positional Roles: The plot shows the differences in scoring roles within teams, where guards generally score more than forwards and centers. This could be indicative of the offensive responsibilities typically assigned to these positions in basketball.
These insights can be particularly useful for team strategy, indicating which positions might require more focus in training for scoring or recruitment to balance team capabilities. They also provide a foundational understanding for further statistical testing, such as comparing means across groups or correlating scoring with other factors like age or experience.
reg_model = lm(PTS ~ `FG%` + `3P%` + `2P%` + `eFG%` + `FT%` + AST + TRB + Pos, data = nba_data_filtered)
summary(reg_model)
##
## Call:
## lm(formula = PTS ~ `FG%` + `3P%` + `2P%` + `eFG%` + `FT%` + AST +
## TRB + Pos, data = nba_data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6594 -1.8248 -0.1881 1.6152 17.4742
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.90504 0.37652 -34.274 < 2e-16 ***
## `FG%` 883.47516 125.94807 7.015 2.45e-12 ***
## `3P%` 276.85679 28.17634 9.826 < 2e-16 ***
## `2P%` -387.96409 94.98483 -4.084 4.45e-05 ***
## `eFG%` 489.11420 117.26342 4.171 3.06e-05 ***
## `FT%` 964.34134 31.09663 31.011 < 2e-16 ***
## AST 1.51059 0.02458 61.447 < 2e-16 ***
## TRB 1.31314 0.01980 66.336 < 2e-16 ***
## PosPF 1.19888 0.11316 10.594 < 2e-16 ***
## PosPG 0.34426 0.16581 2.076 0.0379 *
## PosSF 2.60046 0.12777 20.353 < 2e-16 ***
## PosSG 3.40874 0.14009 24.333 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.099 on 10032 degrees of freedom
## Multiple R-squared: 0.714, Adjusted R-squared: 0.7137
## F-statistic: 2277 on 11 and 10032 DF, p-value: < 2.2e-16
Coefficients and Significance:
FG%,
3P%, eFG%,
FT%: Significant positive coefficients for
these variables indicate that higher shooting efficiencies (field goal,
three-point, effective field goal, and free throw percentages) are
associated with higher points per game. Notably,
FT% and FG%
have exceptionally high coefficients, suggesting a strong impact on
scoring.
2P%: Interestingly, this has a
significant negative coefficient, which might suggest multicollinearity
issues given its likely correlation with
FG% and
eFG%.
Assists (AST) and Total Rebounds (TRB): Both have significant positive impacts on scoring, underscoring the value of players who contribute beyond just shooting.
Player Position (Pos):
Positions like SG (Shooting Guard), SF (Small Forward), and their combinations with other positions generally show significant positive coefficients, indicating these positions typically score more points compared to the baseline position (likely PG - Point Guard).
The coefficients for different positions highlight the scoring dynamics associated with each role on the court, with guards and forwards often contributing more to scoring.
Model Fit:
R-squared (0.7147): About 71.47% of the variability in points scored is explained by the model, which is quite high, suggesting a good fit.
Adjusted R-squared (0.7141): This is very close to the R-squared value, indicating that the number of predictors in the model is justified given the amount of data.
position_summary = nba_data_filtered %>%
group_by(Pos) %>%
summarise(
Avg_Points = mean(PTS),
Avg_AST = mean(AST),
Avg_TRB = mean(TRB),
Avg_FG_Percentage = mean(`FG%`), # Changed variable name
Avg_3P_Percentage = mean(`3P%`), # Changed variable name
Avg_2P_Percentage = mean(`2P%`), # Include if needed
Avg_eFG_Percentage = mean(`eFG%`), # Include if needed
Avg_FT_Percentage = mean(`FT%`) # Include if needed
)
print(position_summary)
## # A tibble: 5 × 9
## Pos Avg_Points Avg_AST Avg_TRB Avg_FG_Percentage Avg_3P_Percentage
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 C 8.96 1.26 6.05 0.00502 0.00171
## 2 PF 9.43 1.42 5.16 0.00457 0.00252
## 3 PG 9.50 4.02 2.50 0.00417 0.00324
## 4 SF 9.59 1.68 3.69 0.00433 0.00322
## 5 SG 10.1 2.13 2.76 0.00421 0.00339
## # ℹ 3 more variables: Avg_2P_Percentage <dbl>, Avg_eFG_Percentage <dbl>,
## # Avg_FT_Percentage <dbl>
The resulting table from this snippet will show the average points, assists, rebounds, and shooting percentages for each position. This can provide insights into:
Offensive and Defensive Roles: Which positions contribute more to scoring or playmaking? Are certain positions more rebound-intensive?
Shooting Efficiency: Which positions have higher shooting percentages? This can indicate specialized training or positional roles in shooting.
coef_data = as.data.frame(summary(reg_model)$coefficients)
ggplot(coef_data, aes(x = rownames(coef_data), y = Estimate, fill = Estimate)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(title = "Regression Coefficients: Predicting Points Per Game", x = "Predictors", y = "Coefficient Estimate")
The chart clearly illustrates the impact of various predictors on points per game according to the regression model. Here are some key observations:
High Positive Impact:
FT% (Free Throw Percentage): Shows
the largest positive coefficient, indicating that improvements in free
throw shooting percentage have a significant positive effect on points
per game.
FG% (Field Goal Percentage): Also
demonstrates a substantial positive impact, which is expected as making
more field goals directly contributes to higher scoring.
Negative Impact:
2P% (Two-Point Percentage): This
predictor shows a negative coefficient, which could suggest a
substitution effect with other types of shots (like three-pointers) or
may indicate multicollinearity with other shooting percentage
variables.
eFG% (Effective Field Goal
Percentage): Interestingly, this coefficient is also positive but less
impactful compared to FG% and
FT%, which might be due to the way it is
calculated (considering three-point field goals).
Positions:
PosSF,
PosSG) show varying levels of impact, with
some positions like Shooting Guard (PosSG)
and Small Forward (PosSF) showing positive
coefficients, indicating that these positions, typically scoring roles,
are likely to score more points.Other Stats:
Assists (AST) and Total
Rebounds (TRB): Both have positive impacts but are
smaller compared to shooting percentages, suggesting while they
contribute to scoring, the direct impact of shooting efficiency is more
pronounced.
Calculating and Visualizing Correlation Matrix
cor_matrix = cor(nba_data_filtered[, c("PTS", "AST", "TRB", "FG%", "3P%", "2P%", "eFG%", "FT%")])
library(reshape2)
melted_cor_matrix = melt(cor_matrix)
ggplot(melted_cor_matrix, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0, limit = c(-1,1), space = "Lab", name="Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
axis.title = element_blank())
Points (PTS) Correlations:
High positive correlations with field goal
percentage (FG%), effective field goal
percentage (eFG%), and free throw
percentage (FT%). This indicates that
players who have higher shooting efficiencies tend to score more
points.
Moderate positive correlation with assists
(AST) and total rebounds
(TRB), suggesting that players who are
more involved in the game (either through passing or rebounding) also
tend to score more.
Assists (AST):
FG%,
3P%, and
eFG%. This could imply that players who
assist more are involved in plays that lead to effective shooting,
possibly indicating good playmaking leads to more efficient scoring
opportunities.Total Rebounds (TRB):
FG% and
2P% but less so with
3P%. This might reflect that players who
are good at rebounding are often in positions to make two-point shots
(perhaps due to being closer to the basket).Shooting Percentages (FG%, 3P%,
2P%, eFG%, FT%):
The correlations among different types of shooting percentages
are generally high, which is expected as they are not independent of
each other. For instance, eFG%, which
accounts for the fact that three-point field goals count more than
two-point field goals, is highly correlated with both
FG% and
3P%.
Free throw percentage (FT%) shows
strong correlations with FG% and
eFG%, suggesting that players who are good
shooters generally perform well across different types of
shooting.
Negative Correlations:
predictive_model = lm(PTS ~ `FG%` + AST + TRB, data = nba_data_filtered)
summary(predictive_model)
##
## Call:
## lm(formula = PTS ~ `FG%` + AST + TRB, data = nba_data_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4517 -2.1162 -0.5069 1.8058 20.3415
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.006375 0.294076 -0.022 0.983
## `FG%` 348.506763 71.345737 4.885 1.05e-06 ***
## AST 1.635263 0.020020 81.683 < 2e-16 ***
## TRB 1.158106 0.018098 63.992 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.656 on 10040 degrees of freedom
## Multiple R-squared: 0.6015, Adjusted R-squared: 0.6014
## F-statistic: 5052 on 3 and 10040 DF, p-value: < 2.2e-16
Model Fit:
Residual Standard Error: The residual standard error is 3.602, which indicates the typical deviation of the observed points scored from the predicted points by the model. A lower value would suggest a tighter fit.
R-squared: 0.6094 suggests that approximately 60.94% of the variability in points scored is explained by the model. This is a reasonably good fit for a model in a complex, dynamic setting like sports.
Adjusted R-squared: 0.6093 is very close to the R-squared value, indicating that the predictors are relevant and the model is not overly complex for the amount of data.
F-statistic: The very low p-value (< 2.2e-16) associated with the F-statistic confirms that the model is statistically significant and that at least one of the predictors has a significant relationship with the points scored.
library(shiny)
## Warning: package 'shiny' was built under R version 4.3.3
library(ggplot2)
library(dplyr)
ui = fluidPage(
titlePanel("NBA Player Statistics Explorer"),
sidebarLayout(
sidebarPanel(
selectInput("stat", "Choose a statistic:",
choices = c("Points" = "PTS", "Assists" = "AST", "Rebounds" = "TRB")),
sliderInput("number_of_games", "Minimum Number of Games:",
min = 0, max = 82, value = 20),
selectInput("modelType", "Choose Model Type:",
choices = c("Basic Model", "Interaction Model"))
),
mainPanel(
tabsetPanel(type = "tabs",
tabPanel("Plot", plotOutput("statPlot")),
tabPanel("Regression Output", verbatimTextOutput("modelOutput")),
tabPanel("Summary Statistics", tableOutput("summaryStats"))
)
)
)
)
server = function(input, output) {
# Dynamic plot based on user input
output$statPlot = renderPlot({
filtered_data = nba_data %>%
filter(G >= input$number_of_games)
ggplot(filtered_data, aes_string(x = "Pos", y = input$stat, fill = "Pos")) +
geom_boxplot() +
labs(title = paste(input$stat, "Per Game by Player Position"), x = "Position", y = input$stat) +
theme_minimal()
})
output$modelOutput = renderPrint({
if (input$modelType == "Basic Model") {
summary(lm(PTS ~ `FG%` + AST + TRB, data = nba_data_filtered))
} else {
summary(lm(PTS ~ `FG%` + `3P%` + `2P%` + `eFG%` + `FT%` + AST + TRB + Pos, data = nba_data_filtered))
}
})
output$summaryStats = renderTable({
nba_data_filtered %>%
group_by(Pos) %>%
summarise(
Avg_Points = mean(PTS),
Avg_AST = mean(AST),
Avg_TRB = mean(TRB),
Avg_FG_Percentage = mean(`FG%`),
Avg_3P_Percentage = mean(`3P%`)
)
})
}
# Run the application
shinyApp(ui = ui, server = server)
Based on the regression model summary you provided earlier, here are
the results and accuracy metrics for the predictive model that assessed
the impact of field goal percentage (FG%),
assists (AST), and total rebounds
(TRB) on points scored
(PTS):
Model Effectiveness:
R-squared = 0.6094). This is a
substantial proportion, indicating that the model has a good level of
predictive power considering the complexity and variability inherent in
sports performance data.Predictor Significance:
Assists (AST): Highly significant
(p < 2e-16) with a coefficient of 1.74311, suggesting a strong
positive impact on scoring. Each additional assist is associated with an
increase of approximately 1.74 points per game.
Total Rebounds (TRB): Also highly
significant (p < 2e-16) with a coefficient of 1.12407, indicating
that rebounds positively influence scoring, with each additional rebound
increasing points by about 1.12.
Field Goal Percentage (FG%):
FG%.Model Fit and Accuracy:
Statistical Power and Reliability:
These results show that while assists and rebounds are good predictors of points scored, the role of field goal percentage might need further investigation, possibly including more data or examining other factors that might interact with or confound the relationship between shooting efficiency and scoring.
Initial Data Load and Examination of Positions
setwd("C:/Users/racha/Desktop/STAT 515")
nba_data = read.csv("nba.csv")
unique_positions = unique(nba_data$Pos)
print(unique_positions)
## [1] "PG" "SG" "SF" "C" "PF" "SG-SF" "SG-PG" "PF-C" "SF-SG"
## [10] "SF-PF" "PF-SF" "C-PF" "PG-SG" "PG-SF" "SG-PF" "SF-C"
library(dplyr)
library(tidyr)
na_count = nba_data %>%
filter(Season == "2019-20") %>%
summarise(NA_in_STL = sum(is.na(STL)))
print(na_count)
## NA_in_STL
## 1 0
stl_distribution = nba_data %>%
filter(Season == "2019-20", !is.na(STL)) %>%
summarise(
Min_STL = min(STL),
Max_STL = max(STL),
Mean_STL = mean(STL)
)
print(stl_distribution)
## Min_STL Max_STL Mean_STL
## 1 0 2.1 0.6160436
nba_filtered = nba_data %>%
filter(Season == "2019-20", !is.na(STL)) %>%
select(Pos, STL,Age)
nba_filtered$Pos = factor(nba_filtered$Pos)
str(nba_filtered)
## 'data.frame': 642 obs. of 3 variables:
## $ Pos: Factor w/ 14 levels "C","C-PF","PF",..: 1 3 1 1 12 12 1 6 3 12 ...
## $ STL: num 0.8 1.1 0.7 0 0.4 0.3 0.6 0.5 1 0 ...
## $ Age: int 26 22 34 23 21 24 21 27 29 26 ...
head(nba_filtered)
## Pos STL Age
## 1 C 0.8 26
## 2 PF 1.1 22
## 3 C 0.7 34
## 4 C 0.0 23
## 5 SG 0.4 21
## 6 SG 0.3 24
This code snippet is crucial for ensuring data quality by identifying and removing any missing values in the ‘STL’ column for the 2019-2020 NBA season. It also provides a statistical summary of the steals per game, calculating the minimum, maximum, and average steals, which offers a preliminary understanding of the data’s distribution. The script further refines the dataset by filtering relevant columns and ensuring the ‘Pos’ column is treated as a categorical factor, setting the stage for accurate and meaningful analysis.
library(ggplot2)
ggplot(nba_filtered, aes(x = Pos, y = STL, fill = Pos)) +
geom_boxplot() +
labs(title = "Distribution of Steals Per Game by Position",
x = "Position",
y = "Steals Per Game") +
theme_minimal()
anova_results = aov(STL ~ Pos, data = nba_filtered)
anova_summary = summary(anova_results)
print(anova_summary)
## Df Sum Sq Mean Sq F value Pr(>F)
## Pos 13 7.53 0.5794 3.729 9.3e-06 ***
## Residuals 628 97.59 0.1554
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This ANOVA test [2] statistically confirms that not all player positions contribute equally to steals, with some positions likely showing higher or lower average steals than others. This finding is crucial for understanding defensive roles and can inform coaching strategies, player development, and game tactics based on positional roles.
tukey_results = TukeyHSD(anova_results)
print(tukey_results)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = STL ~ Pos, data = nba_filtered)
##
## $Pos
## diff lwr upr p adj
## C-PF-C -2.159091e-01 -1.548206908 1.11638873 0.9999994
## PF-C 1.693762e-02 -0.144941689 0.17881694 1.0000000
## PF-C-C 1.440909e-01 -0.460624141 0.74880596 0.9999402
## PF-SF-C -1.590909e-02 -0.790873492 0.75905531 1.0000000
## PG-C 2.927448e-01 0.118718479 0.46677103 0.0000020
## PG-SG-C -1.159091e-01 -1.448206908 1.21638873 1.0000000
## SF-C 1.877273e-01 0.016376197 0.35907835 0.0172595
## SF-C-C -2.159091e-01 -1.548206908 1.11638873 0.9999994
## SF-PF-C 1.507576e-01 -0.624206825 0.92572198 0.9999944
## SF-SG-C 1.840909e-01 -0.761520922 1.12970274 0.9999944
## SG-C 7.337662e-02 -0.087649349 0.23440260 0.9609481
## SG-PG-C -1.159091e-01 -1.061520922 0.82970274 1.0000000
## SG-SF-C -1.159091e-01 -1.448206908 1.21638873 1.0000000
## PF-C-PF 2.328467e-01 -1.099268293 1.56496172 0.9999985
## PF-C-C-PF 3.600000e-01 -1.093962095 1.81396210 0.9999074
## PF-SF-C-PF 2.000000e-01 -1.332610617 1.73261062 1.0000000
## PG-C-PF 5.086538e-01 -0.824991769 1.84229946 0.9916208
## PG-SG-C-PF 1.000000e-01 -1.777056994 1.97705699 1.0000000
## SF-C-PF 4.036364e-01 -0.929662805 1.73693553 0.9991607
## SF-C-C-PF -4.718448e-15 -1.877056994 1.87705699 1.0000000
## SF-PF-C-PF 3.666667e-01 -1.165943951 1.89927728 0.9999374
## SF-SG-C-PF 4.000000e-01 -1.225579041 2.02557904 0.9999137
## SG-C-PF 2.892857e-01 -1.042725865 1.62129729 0.9999795
## SG-PG-C-PF 1.000000e-01 -1.525579041 1.72557904 1.0000000
## SG-SF-C-PF 1.000000e-01 -1.777056994 1.97705699 1.0000000
## PF-C-PF 1.271533e-01 -0.477158896 0.73146547 0.9999859
## PF-SF-PF -3.284672e-02 -0.807496793 0.74180336 1.0000000
## PG-PF 2.758071e-01 0.103185971 0.44842829 0.0000094
## PG-SG-PF -1.328467e-01 -1.464961723 1.19926829 1.0000000
## SF-PF 1.707896e-01 0.000865809 0.34071349 0.0474094
## SF-C-PF -2.328467e-01 -1.564961723 1.09926829 0.9999985
## SF-PF-PF 1.338200e-01 -0.640830126 0.90847003 0.9999987
## SF-SG-PF 1.671533e-01 -0.778200964 1.11250753 0.9999982
## SG-PF 5.643900e-02 -0.103067376 0.21594537 0.9958987
## SG-PG-PF -1.328467e-01 -1.078200964 0.81250753 0.9999999
## SG-SF-PF -1.328467e-01 -1.464961723 1.19926829 1.0000000
## PF-SF-PF-C -1.600000e-01 -1.129308063 0.80930806 0.9999992
## PG-PF-C 1.486538e-01 -0.459024888 0.75633258 0.9999193
## PG-SG-PF-C -2.600000e-01 -1.713962095 1.19396210 0.9999980
## SF-PF-C 4.363636e-02 -0.563281663 0.65055439 1.0000000
## SF-C-PF-C -3.600000e-01 -1.813962095 1.09396210 0.9999074
## SF-PF-PF-C 6.666667e-03 -0.962641397 0.97597473 1.0000000
## SF-SG-PF-C 4.000000e-02 -1.070481893 1.15048189 1.0000000
## SG-PF-C -7.071429e-02 -0.674798438 0.53336987 1.0000000
## SG-PG-PF-C -2.600000e-01 -1.370481893 0.85048189 0.9999511
## SG-SF-PF-C -2.600000e-01 -1.713962095 1.19396210 0.9999980
## PG-PF-SF 3.086538e-01 -0.468625367 1.08593306 0.9878809
## PG-SG-PF-SF -1.000000e-01 -1.632610617 1.43261062 1.0000000
## SF-PF-SF 2.036364e-01 -0.573048271 0.98032100 0.9998239
## SF-C-PF-SF -2.000000e-01 -1.732610617 1.33261062 1.0000000
## SF-PF-PF-SF 1.666667e-01 -0.917052694 1.25038603 0.9999997
## SF-SG-PF-SF 2.000000e-01 -1.011635079 1.41163508 0.9999992
## SG-PF-SF 8.928571e-02 -0.685186489 0.86375792 1.0000000
## SG-PG-PF-SF -1.000000e-01 -1.311635079 1.11163508 1.0000000
## SG-SF-PF-SF -1.000000e-01 -1.632610617 1.43261062 1.0000000
## PG-SG-PG -4.086538e-01 -1.742299462 0.92499177 0.9990467
## SF-PG -1.050175e-01 -0.286550797 0.07651583 0.7991152
## SF-C-PG -5.086538e-01 -1.842299462 0.82499177 0.9916208
## SF-PF-PG -1.419872e-01 -0.919266393 0.63529203 0.9999974
## SF-SG-PG -1.086538e-01 -1.056163682 0.83885599 1.0000000
## SG-PG -2.193681e-01 -0.391189308 -0.04754696 0.0016213
## SG-PG-PG -4.086538e-01 -1.356163682 0.53885599 0.9751039
## SG-SF-PG -4.086538e-01 -1.742299462 0.92499177 0.9990467
## SF-PG-SG 3.036364e-01 -1.029662805 1.63693553 0.9999645
## SF-C-PG-SG -1.000000e-01 -1.977056994 1.77705699 1.0000000
## SF-PF-PG-SG 2.666667e-01 -1.265943951 1.79927728 0.9999986
## SF-SG-PG-SG 3.000000e-01 -1.325579041 1.92557904 0.9999970
## SG-PG-SG 1.892857e-01 -1.142725865 1.52129729 0.9999999
## SG-PG-PG-SG -2.775558e-15 -1.625579041 1.62557904 1.0000000
## SG-SF-PG-SG -6.661338e-16 -1.877056994 1.87705699 1.0000000
## SF-C-SF -4.036364e-01 -1.736935533 0.92966281 0.9991607
## SF-PF-SF -3.696970e-02 -0.813654331 0.73971494 1.0000000
## SF-SG-SF -3.636364e-03 -0.950658504 0.94338578 1.0000000
## SG-SF -1.143506e-01 -0.283461746 0.05476045 0.5729646
## SG-PG-SF -3.036364e-01 -1.250658504 0.64338578 0.9984732
## SG-SF-SF -3.036364e-01 -1.636935533 1.02966281 0.9999645
## SF-PF-SF-C 3.666667e-01 -1.165943951 1.89927728 0.9999374
## SF-SG-SF-C 4.000000e-01 -1.225579041 2.02557904 0.9999137
## SG-SF-C 2.892857e-01 -1.042725865 1.62129729 0.9999795
## SG-PG-SF-C 1.000000e-01 -1.525579041 1.72557904 1.0000000
## SG-SF-SF-C 1.000000e-01 -1.777056994 1.97705699 1.0000000
## SF-SG-SF-PF 3.333333e-02 -1.178301746 1.24496841 1.0000000
## SG-SF-PF -7.738095e-02 -0.851853156 0.69709125 1.0000000
## SG-PG-SF-PF -2.666667e-01 -1.478301746 0.94496841 0.9999761
## SG-SF-SF-PF -2.666667e-01 -1.799277284 1.26594395 0.9999986
## SG-SF-SG -1.107143e-01 -1.055922785 0.83449421 1.0000000
## SG-PG-SF-SG -3.000000e-01 -1.627279729 1.02727973 0.9999674
## SG-SF-SF-SG -3.000000e-01 -1.925579041 1.32557904 0.9999970
## SG-PG-SG -1.892857e-01 -1.134494213 0.75592278 0.9999921
## SG-SF-SG -1.892857e-01 -1.521297293 1.14272586 0.9999999
## SG-SF-SG-PG 2.109424e-15 -1.625579041 1.62557904 1.0000000
Purpose: The Tukey HSD [3] test is used following an ANOVA when the overall test indicates significant differences. It helps in conducting pairwise comparisons between all possible pairs of group means to specifically identify which positions differ from each other in terms of their average steals per game.
Utility: This test provides detailed insights by comparing every position against every other position, adjusting for multiple comparisons to maintain the overall type I error rate. It’s essential for understanding not just if differences exist, but where they exist.
Significant Comparisons:
PG and C: The point guard (PG) position shows a significantly higher number of steals compared to the center (C) position, with a positive difference and a very low p-value (0.0000020), indicating strong statistical significance.
SF and C: Small forwards (SF) also show a significantly higher number of steals compared to centers, with a moderate p-value (0.0172595).
PG and PF: Another significant finding is between point guards and power forwards (PF), where PGs have more steals, indicated by another low p-value (0.0000094).
library(ggplot2)
library(plotly)
steals_summary = nba_filtered %>%
dplyr::group_by(Pos) %>%
dplyr::summarise(
Mean_STL = mean(STL),
SE_STL = sd(STL) / sqrt(n()) # Standard Error of the mean
)
p = ggplot(steals_summary, aes(x = Pos, y = Mean_STL, fill = Pos)) +
geom_bar(stat = "identity", position = position_dodge(), width = 0.7) +
geom_errorbar(aes(ymin = Mean_STL - SE_STL, ymax = Mean_STL + SE_STL),
width = 0.2, position = position_dodge(0.7)) +
labs(title = "Average Steals Per Game by Position",
x = "Position",
y = "Average Steals Per Game") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for clarity
plotly_obj = ggplotly(p)
plotly_obj
Variability and Trends: The bar heights indicate the mean steals per game for each position, showing clear variability across different positions. Positions such as PG (Point Guard) and SF (Small Forward) typically have higher averages, which might reflect their roles involving more perimeter defense and opportunities for steals.
Error Bars: The error bars, representing the standard error of the mean, provide insights into the spread of data around the mean for each position. Positions with longer error bars have more variability in player performance regarding steals, while shorter bars suggest more consistency among players in that position.
Strategic Insights: Coaches and team analysts can use this data to focus on training and strategic planning. For example, strengthening the defensive skills of players in positions with lower average steals might improve overall team performance.
setwd("C:/Users/racha/Desktop/STAT 515")
nba_data = read.csv("nba.csv")
library(dplyr)
library(ggplot2)
library(randomForest)
library(caret)
library(reshape2)
# Get the unique players
unique_players = unique(nba_data$Player)
# Create a data frame with the unique players
unique_nba_data = nba_data[!duplicated(nba_data$Player), ]
#positon comparision
# Create a bar plot of player positions
ggplot(nba_data, aes(x = Pos)) +
geom_bar(fill = "blue") +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
theme_minimal() +
labs(x = "Position", y = "Count", title = "Breakdown of Total Player Positions")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Create a bar plot of player positions for the unique players
ggplot(unique_nba_data, aes(x = Pos)) +
geom_bar(fill = "red") +
geom_text(stat='count', aes(label=..count..), vjust=-1) +
theme_minimal() +
labs(x = "Position", y = "Count", title = "Breakdown of Unique Player Positions")
#splitting the data into train and test
set.seed(44)
trainIndex = createDataPartition(nba_data$Pos, p = .7, list = FALSE)
## Warning in createDataPartition(nba_data$Pos, p = 0.7, list = FALSE): Some
## classes have a single record ( PG-SF, SF-C ) and these will be selected for the
## sample
train = nba_data[trainIndex,]
test = nba_data[-trainIndex,]
train <- na.omit(train)
train$Pos = as.factor(train$Pos)
test$Pos = as.factor(test$Pos)
#summary(nba_data)
# predict 'Pos' using all other variables except 'Player', 'Tm', 'Season', and 'MVP'
model = randomForest(Pos ~ . - Player - Tm - Season - MVP, data = train)
#model
# Use the model to predict player positions in the test set
predictions = predict(model, newdata = test)
# Set the levels of the factor in the test data to match those of the training data
test$Pos <- factor(test$Pos, levels = levels(train$Pos))
# Finally, compare these predictions to the actual positions
cm = confusionMatrix(predictions, test$Pos)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction C C-PF PF PF-C PF-SF PG PG-SF PG-SG SF SF-C SF-PF SF-SG SG
## C 325 4 101 0 0 0 0 0 8 0 0 0 3
## C-PF 1 0 0 0 0 0 0 0 0 0 0 0 0
## PF 129 2 444 3 2 2 0 0 121 0 1 0 29
## PF-C 0 0 0 0 0 0 0 0 0 0 0 0 0
## PF-SF 0 0 1 0 0 0 0 0 0 0 0 0 0
## PG 0 0 3 0 0 692 0 7 24 0 0 1 127
## PG-SF 0 0 0 0 0 0 0 0 0 0 0 0 0
## PG-SG 0 0 0 0 0 1 0 0 0 0 0 0 0
## SF 23 0 136 1 4 16 0 0 413 0 5 3 146
## SF-C 0 0 0 0 0 0 0 0 0 0 0 0 0
## SF-PF 0 0 0 0 0 0 0 0 0 0 0 0 0
## SF-SG 0 0 0 0 0 0 0 0 1 0 0 0 0
## SG 5 0 27 0 0 104 0 1 164 0 0 5 525
## SG-PF 0 0 0 0 0 0 0 0 0 0 0 0 0
## SG-PG 0 0 0 0 0 0 0 0 0 0 0 0 0
## SG-SF 0 0 0 0 0 0 0 0 0 0 0 0 0
## Reference
## Prediction SG-PF SG-PG SG-SF
## C 0 0 0
## C-PF 0 0 0
## PF 0 0 1
## PF-C 0 0 0
## PF-SF 0 0 0
## PG 0 5 1
## PG-SF 0 0 0
## PG-SG 0 0 0
## SF 0 1 3
## SF-C 0 0 0
## SF-PF 0 0 0
## SF-SG 0 0 0
## SG 1 2 4
## SG-PF 0 0 0
## SG-PG 0 0 0
## SG-SF 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.6612
## 95% CI : (0.6456, 0.6766)
## No Information Rate : 0.2288
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5746
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: C Class: C-PF Class: PF Class: PF-C Class: PF-SF
## Sensitivity 0.67288 0.0000000 0.6236 0.000000 0.0000000
## Specificity 0.96312 0.9997239 0.9005 1.000000 0.9997239
## Pos Pred Value 0.73696 0.0000000 0.6049 NaN 0.0000000
## Neg Pred Value 0.95042 0.9983457 0.9074 0.998897 0.9983457
## Prevalence 0.13313 0.0016538 0.1963 0.001103 0.0016538
## Detection Rate 0.08958 0.0000000 0.1224 0.000000 0.0000000
## Detection Prevalence 0.12155 0.0002756 0.2023 0.000000 0.0002756
## Balanced Accuracy 0.81800 0.4998620 0.7621 0.500000 0.4998620
## Class: PG Class: PG-SF Class: PG-SG Class: SF Class: SF-C
## Sensitivity 0.8491 NA 0.0000000 0.5650 NA
## Specificity 0.9403 1 0.9997238 0.8833 1
## Pos Pred Value 0.8047 NA 0.0000000 0.5499 NA
## Neg Pred Value 0.9556 NA 0.9977943 0.8895 NA
## Prevalence 0.2246 0 0.0022051 0.2015 0
## Detection Rate 0.1907 0 0.0000000 0.1138 0
## Detection Prevalence 0.2370 0 0.0002756 0.2070 0
## Balanced Accuracy 0.8947 NA 0.4998619 0.7242 NA
## Class: SF-PF Class: SF-SG Class: SG Class: SG-PF
## Sensitivity 0.000000 0.0000000 0.6325 0.0000000
## Specificity 1.000000 0.9997237 0.8881 1.0000000
## Pos Pred Value NaN 0.0000000 0.6265 NaN
## Neg Pred Value 0.998346 0.9975186 0.8907 0.9997244
## Prevalence 0.001654 0.0024807 0.2288 0.0002756
## Detection Rate 0.000000 0.0000000 0.1447 0.0000000
## Detection Prevalence 0.000000 0.0002756 0.2310 0.0000000
## Balanced Accuracy 0.500000 0.4998618 0.7603 0.5000000
## Class: SG-PG Class: SG-SF
## Sensitivity 0.000000 0.000000
## Specificity 1.000000 1.000000
## Pos Pred Value NaN NaN
## Neg Pred Value 0.997795 0.997519
## Prevalence 0.002205 0.002481
## Detection Rate 0.000000 0.000000
## Detection Prevalence 0.000000 0.000000
## Balanced Accuracy 0.500000 0.500000
#plotting the confusion matrix
# Assuming 'cm' is your confusion matrix
cm_matrix = as.matrix(cm$table)
# Convert the confusion matrix to a data frame
cm_df = as.data.frame(as.table(cm$table))
# Melt the data frame
cm_melt = melt(cm_df)
## Using Prediction, Reference as id variables
# Create the heatmap
ggplot(data = cm_melt, aes(x = Reference, y = Prediction, fill = value)) +
geom_tile() +
geom_text(aes(label = value), vjust = 0.5, color = "black") +
scale_fill_gradient(low = "white", high = "red") +
theme_minimal()
The barcharts created show an overview of the dataset that we are working with. The first barchart shows the number of total player positions in the dataset. The second barchart shows the unique players by position in the dataset for reference
The model predicts the position of a player based on the majority vote from all the decision trees in the forest. Each tree gives a “vote” for that player’s position, and the position with the most votes becomes our model’s prediction. The confusion matrix and the statistics provide a comprehensive view of how well the model is performing. The model has an overall accuracy of about 68.16%, which means it correctly predicted the position of about 68.16% of the players in the test set.